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Abstract 

Sparse  coding — that  is,  modelling  data  vectors  as 
sparse  linear  combinations  of  basis  elements — is 
widely  used  in  machine  learning,  neuroscience, 
signal  processing,  and  statistics.  This  paper  fo¬ 
cuses  on  learning  the  basis  set,  also  called  dic¬ 
tionary,  to  adapt  it  to  specific  data,  an  approach 
that  has  recently  proven  to  be  very  effective  for 
signal  reconstruction  and  classification  in  the  au¬ 
dio  and  image  processing  domains.  This  paper 
proposes  a  new  online  optimization  algorithm 
for  dictionary  learning,  based  on  stochastic  ap¬ 
proximations,  which  scales  up  gracefully  to  large 
datasets  with  millions  of  training  samples.  A 
proof  of  convergence  is  presented,  along  with 
experiments  with  natural  images  demonstrating 
that  it  leads  to  faster  performance  and  better  dic¬ 
tionaries  than  classical  batch  algorithms  for  both 
small  and  large  datasets. 

1.  Introduction 

The  linear  decomposition  of  a  signal  using  a  few  atoms  of 
a  learned  dictionary  instead  of  a  predefined  one — based  on 
wavelets  (Mallat,  1999)  for  example — has  recently  led  to 
state-of-the-art  results  for  numerous  low-level  image  pro¬ 
cessing  tasks  such  as  denoising  (Elad  &  Aharon,  2006) 
as  well  as  higher-level  tasks  such  as  classification  (Raina 
et  ah,  2007;  Mairal  et  ah,  2008a),  showing  that  sparse 
learned  models  are  well  adapted  to  natural  signals.  Un- 
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like  decompositions  based  on  principal  component  analy¬ 
sis  and  its  variants,  these  models  do  not  impose  that  the 
basis  vectors  be  orthogonal,  allowing  more  flexibility  to 
adapt  the  representation  to  the  data.  While  learning  the 
dictionary  has  proven  to  be  critical  to  achieve  (or  improve 
upon)  state-of-the-art  results,  effectively  solving  the  cor¬ 
responding  optimization  problem  is  a  significant  compu¬ 
tational  challenge,  particularly  in  the  context  of  the  large- 
scale  datasets  involved  in  image  processing  tasks,  that  may 
include  millions  of  training  samples.  Addressing  this  chal¬ 
lenge  is  the  topic  of  this  paper. 

Concretely,  consider  a  signal  x  in  K™.  We  say  that  it  ad¬ 
mits  a  sparse  approximation  over  a  dictionary  D  in 
with  k  columns  referred  to  as  atoms,  when  one  can  find  a 
linear  combination  of  a  “few”  atoms  from  D  that  is  “close” 
to  the  signal  x.  Experiments  have  shown  that  modelling  a 
signal  with  such  a  sparse  decomposition  (sparse  coding)  is 
very  effective  in  many  signal  processing  applications  (Chen 
et  ah,  1999).  Eor  natural  images,  predefined  dictionaries 
based  on  various  types  of  wavelets  (Mallat,  1999)  have 
been  used  for  this  task.  However,  learning  the  dictionary 
instead  of  using  off-the-shelf  bases  has  been  shown  to  dra¬ 
matically  improve  signal  reconstruction  (Elad  &  Aharon, 
2006).  Although  some  of  the  learned  dictionary  elements 
may  sometimes  “look  like”  wavelets  (or  Gabor  filters),  they 
are  tuned  to  the  input  images  or  signals,  leading  to  much 
better  results  in  practice. 

Most  recent  algorithms  for  dictionary  learning  (Engan 
et  ah,  1999;  Aharon  et  ah,  2006;  Lee  et  ah,  2007) 
are  second-order  iterative  batch  procedures,  accessing  the 
whole  training  set  at  each  iteration  in  order  to  minimize  a 
cost  function  under  some  constraints.  Although  they  have 
shown  experimentally  to  be  much  faster  than  first-order 
gradient  descent  methods  (Lee  et  ah,  2007),  they  cannot 
effectively  handle  very  large  training  sets  (Bottou  &  Bous- 
quet,  2007),  or  dynamic  training  data  changing  over  time. 
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such  as  video  sequences.  To  address  these  issues,  we  pro¬ 
pose  an  online  approach  that  processes  one  element  (or  a 
small  subset)  of  the  training  set  at  a  time.  This  is  particu¬ 
larly  important  in  the  context  of  image  and  video  process¬ 
ing  (Protter  &  Elad,  2009),  where  it  is  common  to  learn 
dictionaries  adapted  to  small  patches,  with  training  data 
that  may  include  several  millions  of  these  patches  (roughly 
one  per  pixel  and  per  frame).  In  this  setting,  online  tech¬ 
niques  based  on  stochastic  approximations  are  an  attractive 
alternative  to  batch  methods  (Bottou,  1998).  For  exam¬ 
ple,  first-order  stochastic  gradient  descent  with  projections 
on  the  constraint  set  (Kushner  &  Yin,  2003)  is  sometimes 
used  for  dictionary  learning  (see  Aharon  and  Elad  (2008) 
for  instance).  We  show  in  this  paper  that  it  is  possible  to 
go  further  and  exploit  the  specific  structure  of  sparse  cod¬ 
ing  in  the  design  of  an  optimization  procedure  dedicated 
to  the  problem  of  dictionary  learning,  with  low  memory 
consumption  and  lower  computational  cost  than  classical 
second-order  batch  algorithms  and  without  the  need  of  ex¬ 
plicit  learning  rate  tuning.  As  demonstrated  by  our  exper¬ 
iments,  the  algorithm  scales  up  gracefully  to  large  datasets 
with  millions  of  training  samples,  and  it  is  usually  faster 
than  more  standard  methods. 

1.1.  Contributions 

This  paper  makes  three  main  contributions. 

•  We  cast  in  Section  2  the  dictionary  learning  problem  as 
the  optimization  of  a  smooth  nonconvex  objective  function 
over  a  convex  set,  minimizing  the  (desired)  expected  cost 
when  the  training  set  size  goes  to  infinity. 

•  We  propose  in  Section  3  an  iterative  online  algorithm  that 
solves  this  problem  by  efficiently  minimizing  at  each  step  a 
quadratic  surrogate  function  of  the  empirical  cost  over  the 
set  of  constraints.  This  method  is  shown  in  Section  4  to 
converge  with  probability  one  to  a  stationary  point  of  the 
cost  function. 

•  As  shown  experimentally  in  Section  5,  our  algorithm  is 
significantly  faster  than  previous  approaches  to  dictionary 
learning  on  both  small  and  large  datasets  of  natural  im¬ 
ages.  To  demonstrate  that  it  is  adapted  to  difficult,  large- 
scale  image-processing  tasks,  we  leam  a  dictionary  on  a 
12-Megapixel  photograph  and  use  it  for  inpainting. 

2.  Problem  Statement 

Classical  dictionary  learning  techniques  (Olshausen  & 
Field,  1997;  Engan  et  al.,  1999;  Aharon  et  al.,  2006; 
Lee  et  al.,  2007)  consider  a  finite  training  set  of  signals 
X  =  [xi , . . . ,  x„]  in  R™  ^  "  and  optimize  the  empirical  cost 
function 

1  ” 

/„(D)^ -^Z(x„D),  (1) 

n 


where  D  in  ^  ^  is  the  dictionary,  each  column  represent¬ 
ing  a  basis  vector,  and  I  is  a  loss  function  such  that  Z(x,  D) 
should  be  small  if  D  is  “good”  at  representing  the  signal  x. 
The  number  of  samples  n  is  usually  large,  whereas  the  sig¬ 
nal  dimension  m  is  relatively  small,  for  example,  m  =  100 
for  10  X  10  image  patches,  and  n  >  100,  000  for  typical 
image  processing  applications.  In  general,  we  also  have 
k  n  (e.g.,  k  =  200  for  n  =  100,  000),  and  each  signal 
only  uses  a  few  elements  of  D  in  its  representation.  Note 
that,  in  this  setting,  overcomplete  dictionaries  with  k  >  m 
are  allowed.  As  others  (see  (Lee  et  al.,  2007)  for  example), 
we  define  Z(x,  D)  as  the  optimal  value  of  the  ii-sparse  cod¬ 
ing  problem: 

Z(x,D)=  min  ^||x- Da||2 -t- A||a||i,  (2) 

where  A  is  a  regularization  parameter.^  This  problem  is 
also  known  as  basis  pursuit  (Chen  et  al.,  1999),  or  the 
Lasso  (Tibshirani,  1996).  It  is  well  known  that  the 
penalty  yields  a  sparse  solution  for  a,  but  there  is  no  an¬ 
alytic  link  between  the  value  of  A  and  the  corresponding 
effective  sparsity  |  |q:|  |o-  To  prevent  D  from  being  arbitrar¬ 
ily  large  (which  would  lead  to  arbitrarily  small  values  of 
a.),  it  is  common  to  constrain  its  columns  to  have 

an  £2  norm  less  than  or  equal  to  one.  We  will  call  C  the 
convex  set  of  matrices  verifying  this  constraint: 

C  =  {D  €  s.t.  Vj  =  1, . . . ,  fc,  djdj-  <  1}.  (3) 

Note  that  the  problem  of  minimizing  the  empirical  cost 
/„(D)  is  not  convex  with  respect  to  D.  It  can  be  rewrit¬ 
ten  as  a  joint  optimization  problem  with  respect  to  the  dic¬ 
tionary  D  and  the  coefficients  ot  =  [ai, . . . ,  q:„]  of  the 
sparse  decomposition,  which  is  not  jointly  convex,  but  con¬ 
vex  with  respect  to  each  of  the  two  variables  D  and  ot  when 
the  other  one  is  fixed: 

min  -  V  (-||xi -Daj||2 -f  A||q:*||i).  (4) 

DGC.aeR''X"  n  ^  V2  / 

1—1 

A  natural  approach  to  solving  this  problem  is  to  alter¬ 
nate  between  the  two  variables,  minimizing  over  one  while 
keeping  the  other  one  fixed,  as  proposed  by  Lee  et  al. 
(2007)  (see  also  Engan  et  al.  (1999)  and  Aharon  et  al. 
(2006),  who  use  £0  rather  than  £i  penalties,  for  related  ap¬ 
proaches).^  Since  the  computation  of  ot  dominates  the  cost 
of  each  iteration,  a  second-order  optimization  technique 

^The  £p  norm  of  a  vector  x  in  R™'  is  defined,  for  p  >  1,  by 

1 1^1  Ip  =  Following  tradition,  we  denote  by 

I  |x|  |o  the  number  of  nonzero  elements  of  the  vector  x.  This  “fo” 
sparsity  measure  is  not  a  tme  norm. 

^In  our  setting,  as  in  (Lee  et  al.,  2007),  we  use  the  convex  £\ 
norm,  that  has  empirically  proven  to  be  better  behaved  in  general 
than  the  fo  pseudo-norm  for  dictionary  learning. 
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can  be  used  in  this  case  to  accurately  estimate  D  at  each 
step  when  a  is  fixed. 

As  pointed  out  by  Bottou  and  Bousquet  (2007),  however, 
one  is  usually  not  interested  in  a  perfect  minimization  of 
the  empirical  cost  /„(D),  but  in  the  minimization  of  the 
expected  cost 

/(D)  =Ex[((x,D)]  =  lim  /„(D)  a.s.,  (5) 

where  the  expectation  (which  is  assumed  finite)  is  taken  rel¬ 
ative  to  the  (unknown)  probability  distribution  p(x)  of  the 
data.^  In  particular,  given  a  finite  training  set,  one  should 
not  spend  too  much  effort  on  accurately  minimizing  the 
empirical  cost,  since  it  is  only  an  approximation  of  the  ex¬ 
pected  cost. 

Bottou  and  Bousquet  (2007)  have  further  shown  both  the¬ 
oretically  and  experimentally  that  stochastic  gradient  algo¬ 
rithms,  whose  rate  of  convergence  is  not  good  in  conven¬ 
tional  optimization  terms,  may  in  fact  in  certain  settings  he 
the  fastest  in  reaching  a  solution  with  low  expected  cost. 
With  large  training  sets,  classical  batch  optimization  tech¬ 
niques  may  indeed  become  impractical  in  terms  of  speed  or 
memory  requirements. 


In  the  case  of  dictionary  learning,  classical  projected  first- 
order  stochastic  gradient  descent  (as  used  hy  Aharon  and 
Elad  (2008)  for  instance)  consists  of  a  sequence  of  updates 
ofD: 


Di 


He 


Dt_i  —  ^VD^(xt, Dt_i)  , 


(6) 


where  p  is  the  gradient  step.  He  is  the  orthogonal  projec¬ 
tor  on  C,  and  the  training  set  Xi,  X2, ...  are  i.i.d.  samples 
of  the  (unknown)  distrihution  p(x).  As  shown  in  Section 
5,  we  have  observed  that  this  method  can  he  competitive 
compared  to  batch  methods  with  large  training  sets,  when 
a  good  learning  rate  p  is  selected. 


The  dictionary  learning  method  we  present  in  the  next 
section  falls  into  the  class  of  online  algorithms  based 
on  stochastic  approximations,  processing  one  sample  at  a 
time,  but  exploits  the  specific  structure  of  the  problem  to 
efficiently  solve  it.  Contrary  to  classical  first-order  stochas¬ 
tic  gradient  descent,  it  does  not  require  explicit  learning 
rate  tuning  and  minimizes  a  sequentially  quadratic  local  ap¬ 
proximations  of  the  expected  cost. 


3.  Online  Dictionary  Learning 

We  present  in  this  section  the  basic  components  of  our  on¬ 
line  algorithm  for  dictionary  learning  (Sections  3. 1-3.3),  as 
well  as  two  minor  variants  which  speed  up  our  implemen¬ 
tation  (Section  3.4). 

^We  use  “a.s.”  (almost  sure)  to  denote  convergence  with  prob¬ 
ability  one. 


Algorithm  1  Online  dictionary  learning. 

Require;  x  S  R™  ~  p{-x.)  (random  variable  and  an  algo¬ 
rithm  to  draw  i.i.d  samples  of  p),  A  G  K  (regularization 
parameter),  Dq  G  (initial  dictionary),  T  (num¬ 

ber  of  iterations). 

1:  Aq  <—  0,  Bq  <—  0  (reset  the  “past”  information). 

2:  for  f  =  1  to  T  do 

3:  Draw  Xj  from  p(x). 

4:  Sparse  coding:  compute  using  LARS 

at  =  argmin  i||xt  -  Dt_ia||2  -f  A||a||i.  (8) 


5)  -^t  ^ —  -^t _ 1 

6:  Bt  ^  Bt_i  +  Xta^. 

7:  Compute  Dt  using  Algorithm  2,  with  Dt_i  as  warm 

restart,  so  that 


Dt 


A 


.  1 
arg  mm  - 
dgC  t 

.  1 
arg  mm  - 
Dec  I 


X]  2!!^*  “  Da,||2  -b  A||at||i, 
(Tr(D^DAt)-Tr(D^Bt)X9) 


8:  end  for 

9:  Return  (learned  dictionary). 


3.1.  Algorithm  Outline 

Our  algorithm  is  summarized  in  Algorithm  1.  Assuming 
the  training  set  composed  of  i.i.d.  samples  of  a  distribu¬ 
tion  p(x),  its  inner  loop  draws  one  element  Xt  at  a  time, 
as  in  stochastic  gradient  descent,  and  alternates  classical 
sparse  coding  steps  for  computing  the  decomposition  at  of 
Xt  over  the  dictionary  Dt_i  obtained  at  the  previous  itera¬ 
tion,  with  dictionary  update  steps  where  the  new  dictionary 
Dt  is  computed  by  minimizing  over  C  the  function 

1^1 

/t(D)^  -  X2ll^*-I^«*ll2  +  ^l|a.||i,  (7) 

^  i=l 

where  the  vectors  a^  are  computed  during  the  previous 
steps  of  the  algorithm.  The  motivation  behind  our  approach 
is  twofold: 

•  The  quadratic  function  ft  aggregates  the  past  informa¬ 
tion  computed  during  the  previous  steps  of  the  algorithm, 
namely  the  vectors  a^,  and  it  is  easy  to  show  that  it  up- 
perhounds  the  empirical  cost  ft(Dt)  from  Eq.  (1).  One 
key  aspect  of  the  convergence  analysis  will  be  to  show  that 
/t(Dt)  and  /t(Dt)  converges  almost  surely  to  the  same 
limit  and  thus  ft  acts  as  a  surrogate  for  ft . 

•  Since  ft  is  close  to  ft-i,  Dt  can  he  obtained  efficiently 
using  Dt_i  as  warm  restart. 
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Algorithm  2  Dictionary  Update. 

Require;  D  =  [di, . . . ,  d^]  e  (input  dictionary), 

A=  [ai,...,afc]  G  ^  ELi 

B  =  [bi,...,bfc]GK-x'=  =  ELiX*«f- 

1:  repeat 

2:  for  j  =  1  to  /c  do 

3:  Update  the  j-th  column  to  optimize  for  (9): 


efficient.^  Since  our  algorithm  uses  the  value  of  Dt-i  as  a 
warm  restart  for  computing  D(,  a  single  iteration  has  em¬ 
pirically  been  found  to  be  enough.  Other  approaches  have 
been  proposed  to  update  D,  for  instance,  Lee  et  al.  (2007) 
suggest  using  a  Newton  method  on  the  dual  of  Eq.  (9),  but 
this  requires  inverting  ak  x  k  matrix  at  each  Newton  itera¬ 
tion,  which  is  impractical  for  an  online  algorithm. 


-Da^-)  +dj. 


d.y  < -  "  . 

max(||uj||2, 1) 


(10) 


4:  end  for 

5:  until  convergence 

6:  Return  D  (updated  dictionary). 


3.2.  Sparse  Coding 

The  sparse  coding  problem  of  Eq.  (2)  with  fixed  dictio¬ 
nary  is  an  -regularized  linear  least-squares  problem.  A 
number  of  recent  methods  for  solving  this  type  of  prob¬ 
lems  are  based  on  coordinate  descent  with  soft  threshold¬ 
ing  (Eu,  1998;  Eriedman  et  al.,  2007).  When  the  columns 
of  the  dictionary  have  low  correlation,  these  simple  meth¬ 
ods  have  proven  to  be  very  efficient.  However,  the  columns 
of  learned  dictionaries  are  in  general  highly  correlated,  and 
we  have  empirically  observed  that  a  Cholesky-based  im¬ 
plementation  of  the  LARS-Lasso  algorithm,  an  homotopy 
method  (Osborne  et  al.,  2000;  Efron  et  al.,  2004)  that  pro¬ 
vides  the  whole  regularization  path — that  is,  the  solutions 
for  all  possible  values  of  A,  can  be  as  fast  as  approaches 
based  on  soft  thresholding,  while  providing  the  solution 
with  a  higher  accuracy. 

3.3.  Dictionary  Update 

Our  algorithm  for  updating  the  dictionary  uses  block- 
coordinate  descent  with  warm  restarts,  and  one  of  its  main 
advantages  is  that  it  is  parameter-free  and  does  not  require 
any  learning  rate  tuning,  which  can  be  difficult  in  a  con¬ 
strained  optimization  setting.  Concretely,  Algorithm  2  se¬ 
quentially  updates  each  column  of  D.  Using  some  simple 
algebra,  it  is  easy  to  show  that  Eq.  (10)  gives  the  solution 
of  the  dictionary  update  (9)  with  respect  to  the  j-th  column 
dj,  while  keeping  the  other  ones  fixed  under  the  constraint 
djd,  <  1.  Since  this  convex  optimization  problem  ad¬ 
mits  separable  constraints  in  the  updated  blocks  (columns), 
convergence  to  a  global  optimum  is  guaranteed  (Bertsekas, 
1999).  In  practice,  since  the  vectors  a.^  are  sparse,  the  coef¬ 
ficients  of  the  matrix  A  are  in  general  concentrated  on  the 
diagonal,  which  makes  the  block-coordinate  descent  more 


3.4.  Optimizing  the  Algorithm 

We  have  presented  so  far  the  basic  building  blocks  of  our 
algorithm.  This  section  discusses  simple  improvements 
that  significantly  enhance  its  performance. 

Handling  Fixed-Size  Datasets.  In  practice,  although  it 
may  be  very  large,  the  size  of  the  training  set  is  often  fi¬ 
nite  (of  course  this  may  not  be  the  case,  when  the  data 
consists  of  a  video  stream  that  must  be  treated  on  the  fly 
for  example).  In  this  situation,  the  same  data  points  may 
be  examined  several  times,  and  it  is  very  common  in  on¬ 
line  algorithms  to  simulate  an  i.i.d.  sampling  of  p(x)  by 
cycling  over  a  randomly  permuted  training  set  (Bottou  & 
Bousquet,  2007).  This  method  works  experimentally  well 
in  our  setting  but,  when  the  training  set  is  small  enough, 
it  is  possible  to  further  speed  up  convergence:  In  Algo¬ 
rithm  I,  the  matrices  A*  and  Bt  carry  all  the  information 
from  the  past  coefficients  a.i, . . .  ,a.t.  Suppose  that  at  time 
fg,  a  signal  x  is  drawn  and  the  vector  a.t^  is  computed.  If 
the  same  signal  x  is  drawn  again  at  time  t  >  to,  one  would 
like  to  remove  the  “old”  information  concerning  x  from  At 
and  Bt — that  is,  write  At  <—  At_i  +  octoj  —  ottgoj^  for 
instance.  When  dealing  with  large  training  sets,  it  is  im¬ 
possible  to  store  all  the  past  coefficients  ottg ,  but  it  is  still 
possible  to  partially  exploit  the  same  idea,  by  carrying  in 
At  and  Bt  the  information  from  the  current  and  previous 
epochs  (cycles  through  the  data)  only. 

Mini-Batch  Extension.  In  practice,  we  can  improve  the 
convergence  speed  of  our  algorithm  by  drawing  -q  >  1 
signals  at  each  iteration  instead  of  a  single  one,  which  is 
a  classical  heuristic  in  stochastic  gradient  descent  algo¬ 
rithms.  Let  us  denote  xtq, . . .  ,Xt^^  the  signals  drawn  at 
iteration  t.  We  can  then  replace  the  lines  5  and  6  of  Algo¬ 
rithm  1  by 


f  At  ^ /?At_i  4- rin 
\  Bt  ^/jBt_i-fELiX«M, 

where  (3  is  chosen  so  that  /3  =  ,  where  0  =  try  if 

t  <  ry  and  +t  —  q  if  t  >  q,  which  is  compatible  with  our 
convergence  analysis. 

^Note  that  this  assumption  does  not  exactly  hold:  To  be  more 
exact,  if  a  group  of  columns  in  D  are  highly  correlated,  the  co¬ 
efficients  of  the  matrix  A  can  concentrate  on  the  corresponding 
principal  submatrices  of  A. 
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Purging  the  Dictionary  from  Unused  Atoms.  Every  dic¬ 
tionary  learning  technique  sometimes  encounters  situations 
where  some  of  the  dictionary  atoms  are  never  (or  very  sel¬ 
dom)  used,  which  happens  typically  with  a  very  bad  intial- 
ization.  A  common  practice  is  to  replace  them  during  the 
optimization  by  elements  of  the  training  set,  which  solves 
in  practice  this  problem  in  most  cases. 

4.  Convergence  Analysis 

Although  our  algorithm  is  relatively  simple,  its  stochas¬ 
tic  nature  and  the  non-convexity  of  the  objective  function 
make  the  proof  of  its  convergence  to  a  stationary  point 
somewhat  involved.  The  main  tools  used  in  our  proofs 
are  the  convergence  of  empirical  processes  (Van  der  Vaart, 
1998)  and,  following  Bottou  (1998),  the  convergence  of 
quasi-martingales  (Fisk,  1965).  Our  analysis  is  limited  to 
the  basic  version  of  the  algorithm,  although  it  can  in  prin¬ 
ciple  be  carried  over  to  the  optimized  version  discussed 
in  Section  3.4.  Because  of  space  limitations,  we  will  re¬ 
strict  ourselves  to  the  presentation  of  our  main  results  and 
a  sketch  of  their  proofs,  which  will  be  presented  in  de¬ 
tails  elsewhere,  and  first  the  (reasonable)  assumptions  un¬ 
der  which  our  analysis  holds. 

4.1.  Assumptions 

(A)  The  data  admits  a  bounded  probability  density  p 
with  compact  support  K.  Assuming  a  compact  support 
for  the  data  is  natural  in  audio,  image,  and  video  process¬ 
ing  applications,  where  it  is  imposed  by  the  data  acquisition 
process. 

(B)  The  quadratic  surrogate  functions  ft  are  strictly 
convex  with  lower-hounded  Hessians.  We  assume  that 
the  smallest  eigenvalue  of  the  semi-definite  positive  ma¬ 
trix  ^At  defined  in  Algorithm  1  is  greater  than  or  equal 
to  a  non-zero  constant  ki  (making  A*  invertible  and  ft 
strictly  convex  with  Hessian  lower-bounded).  This  hypoth¬ 
esis  is  in  practice  verified  experimentally  after  a  few  iter¬ 
ations  of  the  algorithm  when  the  initial  dictionary  is  rea¬ 
sonable,  consisting  for  example  of  a  few  elements  from  the 
training  set,  or  any  one  of  the  “off-the-shelf”  dictionaries, 
such  as  DCT  (bases  of  cosines  products)  or  wavelets.  Note 
that  it  is  easy  to  enforce  this  assumption  by  adding  a  term 
^||D  |||,  to  the  objective  function,  which  is  equivalent  in 
practice  to  replacing  the  positive  semi-definite  matrix  ^  At 
by  jAt  +  Kil.  We  have  omitted  for  simplicity  this  penal¬ 
ization  in  our  analysis. 

(C)  A  sufficient  uniqueness  condition  of  the  sparse  cod¬ 
ing  solution  is  verified:  Given  some  x  G  K,  where  K  is 
the  support  of  p,  and  D  G  C,  let  us  denote  by  A  the  set  of 
indices  j  such  that  |dj (x  —  Dq:*)|  =  A,  where  at*  is  the 
solution  of  Eq.  (2).  We  assume  that  there  exists  K2  >  0 
such  that,  for  all  x  in  AT  and  all  dictionaries  D  in  the  subset 


5  of  C  considered  by  our  algorithm,  the  smallest  eigen¬ 
value  of  DJDa  is  greater  than  or  equal  to  K2.  This  matrix 
is  thus  invertible  and  classical  results  (Fuchs,  2005)  ensure 
the  uniqueness  of  the  sparse  coding  solution.  It  is  of  course 
easy  to  build  a  dictionary  D  for  which  this  assumption  fails. 
However,  having  D^Da  invertible  is  a  common  assump¬ 
tion  in  linear  regression  and  in  methods  such  as  the  LARS 
algorithm  aimed  at  solving  Eq.  (2)  (Efron  et  al.,  2004).  It 
is  also  possible  to  enforce  this  condition  using  an  elastic 
net  penalization  (Zou  &  Hastie,  2005),  replacing  ||q:||i  by 
||a||i  -f  ^llfvlli  and  thus  improving  the  numerical  stabil¬ 
ity  of  homotopy  algorithms  such  as  LARS.  Again,  we  have 
omitted  this  penalization  for  simplicity. 

4.2.  Main  Results  and  Proof  Sketches 

Given  assumptions  (A)  to  (C),  let  us  now  show  that  our 
algorithm  converges  to  a  stationary  point  of  the  objective 
function. 

Proposition  1  (convergence  of  /(Dj)  and  of  the  sur¬ 
rogate  function).  Let  ft  denote  the  surrogate  function 
defined  in  Eq.  (7).  Under  assumptions  (A)  to  (C): 

•  ft(Dt)  converges  a.s.; 

./(D*)-/t(D0  converges  a.s.  to  0;  and 

•  f(Dt)  converges  a.s. 

Proof  sktech:  The  first  step  in  the  proof  is  to  show  that 
Dt  —  D4_i  =  which,  although  it  does  not  ensure 

the  convergence  of  Dj,  ensures  the  convergence  of  the  se¬ 
ries  w^t  -  T>t-i\\p,  a  classical  condition  in  gradi¬ 

ent  descent  convergence  proofs  (Bertsekas,  1999).  In  turn, 
this  reduces  to  showing  that  Dj  minimizes  a  parametrized 
quadratic  function  over  C  with  parameters  ^At  and 
then  showing  that  the  solution  is  uniformly  Lipschitz  with 
respect  to  these  parameters,  borrowing  some  ideas  from 
perturbation  theory  (Bonnans  &  Shapiro,  1998).  At  this 
point,  and  following  Bottou  (1998),  proving  the  conver¬ 
gence  of  the  sequence  /((Dj)  amounts  to  showing  that  the 
stochastic  positive  process 

Ut  =  M^t)  >  0,  (12) 

is  a  quasi-martingale.  To  do  so,  denoting  by  Tt  the  filtra¬ 
tion  of  the  past  information,  a  theorem  by  Fisk  (1965)  states 
that  if  the  positive  sum  ]E[max(E[itt+i  —  Ut\J-t],(f)\ 
converges,  then  ut  is  a  quasi-martingale  which  converges 
with  probability  one.  Using  some  results  on  empirical  pro¬ 
cesses  (Van  der  Vaart,  1998,  Chap.  19.2,  Donsker  The¬ 
orem),  we  obtain  a  bound  that  ensures  the  convergence 
of  this  series.  It  follows  from  the  convergence  of  Ut  that 
/t(Dt)  —  /t(Dt)  converges  to  zero  with  probability  one. 
Then,  a  classical  theorem  from  perturbation  theory  (Bon¬ 
nans  &  Shapiro,  1998,  Theorem  4.1)  shows  that  Z(x,  D) 
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is  .  This,  allows  us  to  use  a  last  result  on  empirical 
processes  ensuring  that  f(Dt)  —  ft(Dt)  converges  almost 
surely  to  0.  Therefore  f(Dt)  converges  as  well  with  prob¬ 
ability  one.  ■ 

Proposition  2  (convergence  to  a  stationary  point).  Un¬ 
der  assumptions  (A)  to  (C),  Dj  is  asymptotically  close  to 
the  set  of  stationary  points  of  the  dictionary  learning  prob¬ 
lem  with  probability  one. 

Proof  sktech:  The  first  step  in  the  proof  is  to  show  using 
classical  analysis  tools  that,  given  assumptions  (A)  to  (C), 
/  is  with  a  Lipschitz  gradient.  Considering  A  and  B 
two  accumulation  points  of  |  At  and  |Bt  respectively,  we 
can  define  the  corresponding  surrogate  function  fao  such 
that  for  all  D  in  C,  /00(D)  =  Tr(D'^DA)  -  Tr(D^B), 
and  its  optimum  Dqq  on  C.  The  next  step  consists  of  show¬ 
ing  that  V/oo(Doo)  =  V/(Doo)  and  that -V/(Doo)  is  in 
the  normal  cone  of  the  set  C — that  is,  Doo  is  a  stationary 
point  of  the  dictionary  learning  problem  (Borwein  & 
Lewis,  2006).  ■ 

5.  Experimental  Validation 

In  this  section,  we  present  experiments  on  natural  images 
to  demonstrate  the  efficiency  of  our  method. 

5.1.  Performance  evaluation 

For  our  experiments,  we  have  randomly  selected  1.25  x 
10®  patches  from  images  in  the  Berkeley  segmentation 
dataset  (Martin  et  al.,  2001),  which  is  a  standard  image 
database;  10®  of  these  are  kept  for  training,  and  the  rest  for 
testing.  We  used  these  patches  to  create  three  datasets  A, 
B,  and  C  with  increasing  patch  and  dictionary  sizes  repre¬ 
senting  various  typical  settings  in  image  processing  appli¬ 
cations: 


Data 

Signal  size  m 

Nb  k  of  atoms 

Type 

A 

8  X  8  =  64 

256 

b&w 

B 

12  X  12  X  3  =  432 

512 

color 

C 

16  X  16  =  256 

1024 

b&w 

We  have  normalized  the  patches  to  have  unit  £2 -norm  and 
used  the  regularization  parameter  A  =  1.2/1/77)  in  all  of 
our  experiments.  The  Ijyjm  term  is  a  classical  normaliza¬ 
tion  factor  (Bickel  et  al.,  2007),  and  the  constant  1.2  has 
been  experimentally  shown  to  yield  reasonable  sparsities 
(about  10  nonzero  coefficients)  in  these  experiments.  We 
have  implemented  the  proposed  algorithm  in  C-i-H-  with  a 
Matlab  interface.  All  the  results  presented  in  this  section 
use  the  mini-batch  refinement  from  Section  3.4  since  this 
has  shown  empirically  to  improve  speed  by  a  factor  of  10 
or  more.  This  requires  to  tune  the  parameter  77,  the  number 
of  signals  drawn  at  each  iteration.  Trying  different  powers 


of  2  for  this  variable  has  shown  that  77  =  256  was  a  good 
choice  (lowest  objective  function  values  on  the  training  set 
—  empirically,  this  setting  also  yields  the  lowest  values  on 
the  test  set),  but  values  of  128  and  and  512  have  given  very 
similar  performances. 

Our  implementation  can  be  used  in  both  the  online  setting 
it  is  intended  for,  and  in  a  regular  batch  mode  where  it 
uses  the  entire  dataset  at  each  iteration  (corresponding  to 
the  mini-batch  version  with  rj  =  n).  We  have  also  imple¬ 
mented  a  first-order  stochastic  gradient  descent  algorithm 
that  shares  most  of  its  code  with  our  algorithm,  except 
for  the  dictionary  update  step.  This  setting  allows  us  to 
draw  meaningful  comparisons  between  our  algorithm  and 
its  batch  and  stochastic  gradient  alternatives,  which  would 
have  been  difficult  otherwise.  For  example,  comparing  our 
algorithm  to  the  Matlab  implementation  of  the  batch  ap¬ 
proach  from  (Lee  et  al.,  2007)  developed  by  its  authors 
would  have  been  unfair  since  our  C-n-  program  has  a  built- 
in  speed  advantage.  Although  our  implementation  is  multi¬ 
threaded,  our  experiments  have  been  run  for  simplicity  on  a 
single-CPU,  single-core  2.4Ghz  machine.  To  measure  and 
compare  the  performances  of  the  three  tested  methods,  we 
have  plotted  the  value  of  the  objective  function  on  the  test 
set,  acting  as  a  surrogate  of  the  expected  cost,  as  a  function 
of  the  corresponding  training  time. 

Online  vs  Batch.  Figure  1  (top)  compares  the  online  and 
batch  settings  of  our  implementation.  The  full  training  set 
consists  of  10®  samples.  The  online  version  of  our  algo¬ 
rithm  draws  samples  from  the  entire  set,  and  we  have  run 
its  batch  version  on  the  full  dataset  as  well  as  subsets  of  size 
10^  and  10®  (see  figure).  The  online  setting  systematically 
outperforms  its  batch  counterpart  for  every  training  set  size 
and  desired  precision.  We  use  a  logarithmic  scale  for  the 
computation  time,  which  shows  that  in  many  situations,  the 
difference  in  performance  can  be  dramatic.  Similar  experi¬ 
ments  have  given  similar  results  on  smaller  datasets. 

Comparison  with  Stochastic  Gradient  Descent.  Our  ex¬ 
periments  have  shown  that  obtaining  good  performance 
with  stochastic  gradient  descent  requires  using  both  the 
mini-batch  heuristic  and  carefully  choosing  the  learning 
rate  p.  To  give  the  fairest  comparison  possible,  we  have 
thus  optimized  these  parameters,  sampling  77  values  among 
powers  of  2  (as  before)  and  p  values  among  powers  of  10. 
The  combination  of  values  p  =  10^,  77  =  512  gives  the 
best  results  on  the  training  and  test  data  for  stochastic  gra¬ 
dient  descent.  Figure  1  (bottom)  compares  our  method  with 
stochastic  gradient  descent  for  different  p  values  around 
10^  and  a  fixed  value  of  77  =  512.  We  observe  that  the 
larger  the  value  of  p  is,  the  better  the  eventual  value  of  the 
objective  function  is  after  many  iterations,  but  the  longer  it 
will  take  to  achieve  a  good  precision.  Although  our  method 
performs  better  at  such  high-precision  settings  for  dataset 
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Evaluation  set  A  Evaluation  set  B  Evaluation  set  C 


Figure  1.  Top:  Comparison  between  online  and  batch  learning  for  various  training  set  sizes.  Bottom:  Comparison  between  our  method 
and  stochastic  gradient  (SG)  descent  with  different  learning  rates  p.  In  both  cases,  the  value  of  the  objective  function  evaluated  on  the 
test  set  is  reported  as  a  function  of  computation  time  on  a  logarithmic  scale.  Values  of  the  objective  function  greater  than  its  initial  value 
are  truncated. 


C,  it  appears  that,  in  general,  for  a  desired  precision  and  a 
particular  dataset,  it  is  possible  to  tune  the  stochastic  gra¬ 
dient  descent  algorithm  to  achieve  a  performance  similar 
to  that  of  our  algorithm.  Note  that  both  stochastic  gradi¬ 
ent  descent  and  our  method  only  start  decreasing  the  ob¬ 
jective  function  value  after  a  few  iterations.  Slightly  better 
results  could  be  obtained  by  using  smaller  gradient  steps 
during  the  first  iterations,  using  a  learning  rate  of  the  form 
p/{t  +  to)  for  the  stochastic  gradient  descent,  and  initializ¬ 
ing  Aq  =  fol  and  Bq  =  foDo  for  the  matrices  At  and  Bt, 
where  to  is  a  new  parameter. 

5.2.  Application  to  Inpainting 

Our  last  experiment  demonstrates  that  our  algorithm  can 
be  used  for  a  difficult  large-scale  image  processing  task, 
namely,  removing  the  text  (inpainting)  from  the  damaged 
12-Megapixel  image  of  Figure  2.  Using  a  multi-threaded 
version  of  our  implementation,  we  have  learned  a  dictio¬ 
nary  with  256  elements  from  the  roughly  7  x  10®  undam¬ 
aged  12x12  color  patches  in  the  image  with  two  epochs  in 
about  500  seconds  on  a  2.4GHz  machine  with  eight  cores. 
Once  the  dictionary  has  been  learned,  the  text  is  removed 
using  the  sparse  coding  technique  for  inpainting  of  Mairal 
et  al.  (2008b).  Our  intent  here  is  of  course  not  to  evaluate 
our  learning  procedure  in  inpainting  tasks,  which  would  re¬ 
quire  a  thorough  comparison  with  state-the-art  techniques 
on  standard  datasets.  Instead,  we  just  wish  to  demonstrate 
that  the  proposed  method  can  indeed  be  applied  to  a  re¬ 


alistic,  non-trivial  image  processing  task  on  a  large  im¬ 
age.  Indeed,  to  the  best  of  our  knowledge,  this  is  the  first 
time  that  dictionary  learning  is  used  for  image  restoration 
on  such  large-scale  data.  For  comparison,  the  dictionaries 
used  for  inpainting  in  the  state-of-the-art  method  of  Mairal 
et  al.  (2008b)  are  learned  (in  batch  mode)  on  only  200,000 
patches. 

6.  Discussion 

We  have  introduced  in  this  paper  a  new  stochastic  online  al¬ 
gorithm  for  learning  dictionaries  adapted  to  sparse  coding 
tasks,  and  proven  its  convergence.  Preliminary  experiments 
demonstrate  that  it  is  significantly  faster  than  batch  alterna¬ 
tives  on  large  datasets  that  may  contain  millions  of  training 
examples,  yet  it  does  not  require  learning  rate  tuning  like 
regular  stochastic  gradient  descent  methods.  More  exper¬ 
iments  are  of  course  needed  to  better  assess  the  promise 
of  this  approach  in  image  restoration  tasks  such  as  denois- 
ing,  deblurring,  and  inpainting.  Beyond  this,  we  plan  to 
use  the  proposed  learning  framework  for  sparse  coding  in 
computationally  demanding  video  restoration  tasks  (Prot- 
ter  &  Elad,  2009),  with  dynamic  datasets  whose  size  is  not 
fixed,  and  also  plan  to  extend  this  framework  to  different 
loss  functions  to  address  discriminative  tasks  such  as  im¬ 
age  classification  (Mairal  et  al.,  2008a),  which  are  more 
sensitive  to  overfitting  than  reconstructive  ones,  and  vari¬ 
ous  matrix  factorization  tasks,  such  as  non-negative  matrix 
factorization  with  sparseness  constraints. 
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Figure  2.  Inpainting  example  on  a  12-Megapixel  image.  Top: 
Damaged  and  restored  images.  Bottom:  Zooming  on  the  dam¬ 
aged  and  restored  images.  (Best  seen  in  color) 
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